Distributed Control by Lagrangian Steepest Descent 
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Often adaptive, distributed control can be viewed as an iterated game between independent 
players. The coupling between the players' mixed strategies, arising as the system evolves from one 
instant to the next, is determined by the system designer. Information theory tells us that the most 
likely joint strategy of the players, given a value of the expectation of the overall control objective 
function, is the minimizer of a Lagrangian function of the joint strategy. So the goal of the system 
designer is to speed evolution of the joint strategy to that Lagrangian minimizing point, lower the 
expectated value of the control objective function, and repeat. Here we elaborate the theory of 
algorithms that do this using local descent procedures, and that thereby achieve efficient, adaptive, 
distributed control. 
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I. INTRODUCTION 

This paper considers the problem of adaptive dis- 
tributed control 0, 0i I3- Typically in such problems, 
at each time t each control agent i sets its state x\ inde- 
pendently of the other agents, by sampling an associated 
distribution, ql[x\). Rather than directly via statisti- 
cal dependencies of the agents' states at the same time i, 
the coupling between the agents arises indirectly, through 
the stochastic joint evolution of their distributions 
across time. 

More formally, let time be discrete, where at the be- 
ginning of each t every agent sets its state ("makes its 
move"), and then the rest of the system responds. Indi- 
cate the state of the entire system at time t as z*. (z* 
includes well as all stochastic elements not being di- 
rectly controlled.) So the joint distribution of the moves 
of the agents at any moment t is given by the product dis- 
tribution q^{x^) = ql{x\), and the state of the entire 
system, given joint move a;*, is governed by P(z* | x*). 

Now in general the observations by agent i of aspects 
of the system's state at times previous to t will determine 
g*. So q\ is statistically dependent on the previous states 
of the entire system, z^* <*h In other words, the agents 
can be viewed as players in a repeated game with Nature, 
each playing mixed strategies {q\} at moment t 0,0, IE 
0, 01 • Their interdependence arises through information 
sets and the like, in the usual way. 

From this perspective what the designer of a dis- 
tributed control system can specify is the stochastic 
laws governing the updating of the joint strategy. In 
other words, the designer wishes to impose a stochas- 
tic dynamics on a Multi-Agent System (MAS) that op- 
timizes an overall objective function of the state of the 
system in which the MAS is embedded, J^(z).j32| For- 
mally, this means inducing a joint strategy q{x) with a 
good associated value of Eq{F) = / dxq{x)E{F \ x) = 
/ dxq(x)G(x) . ^\ Once such a q is found, one can sample 
it to get a final x, and be assured that, on average, the 
associated F value is low. G is called the world utility. 



In this paper we elaborate a set of algorithms that 
iteratively update g* in such a manner. The algo- 
rithms presented here are based on using steepest descent 
techni ques to minimize a G-parameterized Lagrangian, 
iG(9)-_34| Because the descent is over Euclidean vectors 
q, these algorithms can be applied whether the Xi are cat- 
egorical, continuous, time-extended, or a mixture of the 
three. So in particular, they provide a principled way to 
do "gradient descent over categorical variables" . 

In the next section we first derive the Lagrangian 
Lclq) and discuss some of its properties. 

In the following section we show how to apply gradi- 
ent descent (and its embellishments) to optimization of 
the Lagrangian. If we view the agents as engaged in a 
team game, all having the same utility G, then this gra- 
dient descent is a distributed scheme for each agent to 
update its strategy, in a way that will steer the game to 
a bounded rational equilibrium 9, 14]. 

In this section we also consider second order methods. 
In contrast to gradient descent, in general any single ap- 
plication of Newton's method to update a product dis- 
tribution q will result in a new distribution p{q) that is 
not a product distribution. So we must instead solve for 
the product distribution q'{p) having minimal KuUback- 
Leibler distance to p. In this section we derive the rule 
for iterative updating of our distribution so as to move q 
in the direction of q'{p{q)). 

In practice any local descent scheme often requires 
Monte Carlo sampling to estimate terms in the gradient. 
To minimize the expected quadratic error of the estima- 
tion, typically the game is changed from being a team 
game. In other words, in general changing the agent's 
utilities gi to not all equal G will result in lower bias plus 
variance of the estimation of the gradient, and therefore 
will speed evolution to a good joint strategy. These and 
other techniques for shrinking bias plus variance are dis- 
cussed in [13, llil ■ 

We end this section by mentioning some other tech- 
niques for improving the Monte Carlo sampling. These 
include data-aging, and techniques for managing the de- 
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scent when it gets close to a border of the (simplex of) 
allowed q, Q. Most of these techniques introduced can 
be used even with schemes for minimizing Lc^q) other 
than gradient descent. 

In the final section we introduce some alternatives to 
Lciq), designed to help speed convergence to a q with 
low E{G). Miscellaneous proofs can be found in the ap- 
pendix. 

The general mathematical framework for casting con- 
trol and optimization problems in terms of minimizing 
Lagangians of probability distributions is known as the 
theory of probability Lagangians. The precise version 
where the probability distributions are product distri- 
butions is known as "Product Distribution" (PD) the- 
ory. Tl'l. It has many deep connections to other fields, 
including bounded rational game theory and statistical 
physics [iJI- As such it serves as a mathematical bridge 
connecting these disciplines. Some initial experimental 
results concerning the use of PD theory for distributed 
optimization and distributed control can be found in 
[H El m El El 113. See [H El EH for other uses 
and extensions of PD theory. 



II. PRODUCT DISTRIBUTION LAGRANGIANS 
A. The maxent Lagrangian 

Say the designer stipulates a particular desired value 
of E{G), 7. For simplicity, consider the case where the 
designer has no other knowledge concerning the system 
besides 7 and the fact that the joint strategy is a product 
distribution. Then information theory tells us that the 
a priori most likely q consistent with that information is 
the one that maximizes entropy subject to that informa- 
tion 22, 23, 24). 35] In other words, of all distributions 
that agree with the designer's information, that distribu- 
tion is the "easiest" one to induce by random search. 

Given this, one can view the job of the designer of a 
distributed control system as an iterative equilibration 
process. In the first stage of each iteration the designer 
works to speed evolution of the joint strategy to the q 
with maximal entropy subject to a particular value of 7. 
Once we have found such a solution we can replace the 
constraint — replace the target value of E{G) — with 
a more difficult one, and then repeat the process, with 
another evolution of q pjj . 

Define the maixent Lagrangian by 

Liq) ^ I3{E,{G)~^)^S 

= (3{[ dxq{x)G{x) - 7) - 5(<z), (1) 



where S{q) is the Shannon entropy of g, 
— / dx(?(a;)ln^^, and for simplicity we here take 
the prior ji to be uniform. [3fil|. Writing it out, for a given 
7, the associated most likely joint strategy is given by 
the q that minimizes L[q) over all those (f^j/S) such that 



the Lagrange parameter /3 is at a critical point of L, i.e., 
such that f| = 0. 

Solving, we find that the qi are related to each other 
via a set of coupled Boltzmann equations (one for each 
agent i). 
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where the overall proportionality constant for each i is 
set by normalization, the subscript Qj'^^ on the expecta- 
tion value indicates that it is evaluated according to the 
distribution H j^ti 1i ■• ^^"^ P ^o enforce the condition 
Eqi3{G) = 7. Following Nash, we can use Brouwer's fixed 
point theorem to establish that for any fixed /3, there 
must exist at least one solution to this set of simultane- 
ous equations. 

If we evaluate E{G) at the solution q^ ^ we find that it 
is a declining function of (3. So in following the iterative 
procedure of equilibrating and then lowering 7 we we will 
raise (3. Accordingly, we can avoid the steps of testing 
whether each successive constraint E(G) = 7 is met, and 
simply monotonically increase [3 instead. This allows us 
to avoid ever explicitly specifying the values of 7. 

Simulated annealing is an example of doing this, where 
rather than work directly with one works with random 
samples of it formed via the Metropolis random walk 
algorithm '2^, "2?, '28'! . There is no a priori reason to 
use such an inefficient means of manipulating q however. 
Here we will work with q directly instead. This will result 
in an algorithm that is not simply "probabilistic" in the 
sense that the updating of its variables is stochastic (as 
in simulated annealing). Rather the very entity being 
updated is a probability distribution. 



B. Shape of the mcLxent Lagrangian 

Consider L as a function of q, with (3 and 7 both 
treated as fixed parameters. (So in particular, Eq{g) need 
not equal 7.) First, say that is also held fixed, with 
only qi allowed to vary. This makes E[g) linear in qi. 
In addition, entropy is a concave function, and the unit 
simplex is a convex region. Accordingly, the Lagrangian 
of Eq.^has a unique local minimum over qi. So there is 
no issue of choosing among multiple minima when all of 
(7(j) is fixed. Nor is there any problem of "getting trapped 
in a local minimum" in a computational search for that 
minimum. Indeed, in this situation we can just jump 
directly to that global optimum, via Eq. [3 

Now introduce the shorthand for any function U{x), 



[U]i^p{xi) = J dx(i)U{xi,X(i))p(x(^ I Xi). 

So [G\i^q^.^{xi) is agent Vs "effective" cost function, 
Eq^^^(G I Xi). Consider the value E^ii {[G]i,q^.^). This is 

the value of E(G) at i's bounded rational equilibrium for 
the fixed (/(^^ , i.e., it is the value at the minimum over qi of 
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L. View that value as a function of /3. One can show that 
this is a decreasing function. In fact, its derivative just 
equals the negative of the variance of [GJi^^^.j {xi) evalu- 
ated under distribution q^{xi) (see appendix). Combin- 
ing this with the fact that E{G) is bounded below (for 
bounded G) , establishes that the variance must go to zero 
for large enough p. So as (3 grows, q^ixi) for all Xi 
that don't minimize Eg^..^ (G \ Xi). In other words, in that 
limit, qi becomes Nash-optimal. 

Next consider varying over all q G Q, the space of 
all product distributions q. This is a convex space; if 
p G Q and p' £ Q, then so is any distribution on the line 
connecting p and p'. However over this space, the E{G) 
term in L is multilinear. So L is not a simple convex 
function of q. So we do not have guarantees of a single 
local minimum. 

The following lemma extends the technique of La- 
grange parameters to off-equilibrium points: 

Lemma 1: Consider the set of all vectors leading from 
x' e R" that are, to first order, consistent with a set 
of constraints over M", {fi{x) =0}. Of those vectors, 
the one giving the steepest ascent of a function V{x) is 
u ~ W + fi, up to an overall proportionality 

constant, where the enforce the first order consistency 
conditions, m • V/i = Vi. 

Now examine the derivatives of S{q) with respect to 
all components of q, i.e., the g-gradient of the entropy. 
At the border of Q, at least one of the \n{qi) terms in 
those derivatives will be negative infinite. Combined with 
Lemma 1, this can be used to establish that at the edge 
of Q, the steepest descent direction of any player's La- 
grangian points into the interior of Q (assuming finite /3 
and {G}). (This is reflected in the equilibrium solutions 
Eq. 121) Accordingly, whereas Nash equilibria can be on 
the edge of Q (e.g., for a pure strategy Nash equilibrium), 
in bounded rational games any equilibrium must lie in the 
interior of Q. In other words, any equilibrium (i.e., any 
local minimum) of a bounded rational game has non-zero 
probability for all joint moves. So just as when only vary- 
ing a single qi , we never have to consider extremal mixed 
strategies in searching for equilibria over all Q. We can 
use local descent schemes instead 

Lemma 1 can also be used to construct G with more 
than one solution to Eq. El One can also show that for 
every player i and any point q interior to Q, there are di- 
rections in Q along which i's Lagrangian is locally convex. 
Accordingly, no player's Lagrangian has a local maximum 
interior to Q. So if there are multiple local minima of «'s 
Lagrangian, they are separated by saddle points across 
ridges. In addition, the uniform g is a solution to the set 
of coupled equations Eq. |21 but typically is not a local 
minimum, and therefore must be a saddle point. 

Say that we were not restricting ourselves to prod- 
uct distributions. So the Lagrangian becomes L{p) = 
(3{Ep{G) — 7) — S{p), where p can now be any distribu- 
tion over X. There is only one local minimum over p of 



this Lagrangian, the canonical ensemble: 

/(x)oce-^«(-) 

In general p^ is not a product distribution. However we 
can ask what product distribution is closest to it. 

Now in general, the proper way to approximate a tar- 
get distribution p with a distribution from a subset C 
of the set of all distributions is to first specify a misfit 
measure saying how well each member of C approximates 
p, and then solve for the member with the smallest mis- 
fit. This is just as true when C is the set of all product 
distributions as when it is any other set. 

How best to measure distances between probability dis- 
tributions is a topic of ongoing controversy and research 
|30j |. The most common way to do so is with the infi- 
nite limit log likelihood of data being generated by one 
distribution but misattributed to have come from the 
other. This is know as the Kullback-Leibler distance 
[22, 23, 31]: 

KL{p^\\p,) = S{p,\\p2)-S{p,) (3) 

where S{pi \\ P2) = ~ J dx pi(a;)ln[£^^] is known as 
the cross entropy from pi to p2 (and as usual we im- 
plicitly choose uniform fi). The KL distance is always 
non-negative, and equals zero iff its two arguments are 
identical. 

As shorthand, define the "pq distance" as KL{p \ \ q), 
and the "gp distance" as KL{q \ \ p), where p is our tar- 
get distribution and q is a, product distribution. Then 
it is straightforward to show that the qp distance from q 
to target distribution p^ is just the maxent Lagrangian 
L[q)^ up to irrelevant overall constants. In other words, 
the q minimizing the maxent Lagrangian is q with the 
minimal qp distance to the associated canonical ensem- 
ble. 

However the qp distance is the (infinite limit of the 
negative log of) the likelihood that distribution p would 
attribute to data generated by distribution q. It can be 
argued that a better measure of how well q approximates 
p would be based on the likelihood that q attributes to 
data generated by p. This is the pq distance; it gives a 
different Lagrangian from that of Eq. ^ 

Evaluating, up to an overall additive constant (of the 
canonical distribution's entropy) , the pq distance is 

KL(j) \ \q) = - ^ j dx p{x)\a\q^{xi)]. 

i 

This is equivalent to a game where each coordinate i has 
the "Lagrangian" 

L*{q) ^ - j dx^p.,{xi)\n[q^{,% (4) 

where Pi (xi) is the marginal distribution J dx(^{jp{x). The 
minimizer of this is just qi = Pi Vi, i.e., each qi is set to 
the associated marginal distribution of p. 

In the interests of space, the rest of this paper we 
restrict attention to the pq KL distance and associated 
maxent Lagrangian. 
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III. DESCENT OF THE MAXENT 
LAGRANGIAN 

A. Gradient descent 

Consider the situation where each Xi can take on a 
finite number of possible values, Say we are iter- 

atively evolving q to minimize L for some fixed /?, and 
are currently at some point q E Q. Using Lemma 1, we 
can evaluate the direction from q within Q that, to first 
order, will result in the largest drop in the value of L{q): 



dL{q) 



dqiix, = j) 



(5) 



where Ui{j) = [3E{G \ Xi = j) + ln[qi{j)]. (Intuitively, 
the reason for subtracting Ui{x^)/\£_i\ is to keep the 
distribution in the set of all possible probability distribu- 
tions over X, P.) 

Eq.|21specifies the change that each agent should make 
to its distribution to have them jointly implement a step 
in steepest descent of the maxent Lagrangian. These up- 
dates are completely distributed, in the sense that each 
agent's update at time t is independent of any other 
agents' update at that time. Typically at any t each 
agent i knows qi{t) exactly, and therefore knows ln[(7i(j)]. 
However often it will not know G and/or the (7(j). In such 
cases it will not be able to evaluate the E{G \ Xi — j) 
terms in Eq. [Sjin closed form. 

One way to circumvent this problem is to have those 
expectation values be simultaneously estimated by all 
agents by repeated Monte Carlo sampling of q to pro- 
duce a set of {x,G{x)) pairs. Those pairs can then be 
used by each agent i to estimate the values E[G \ Xi — j), 
and therefore how it should update its distribution. In 
the simplest version of such an update to q only occurs 
once every L time-steps. In this scheme only the samples 
(a;, G{x)) formed within a block of L successive time-steps 
are used at the end of that block by the agents to update 
their distributions (according to Eq. [SJ . 



B. Higher order descent schemes 

In general, second order descent (e.g., Newton's 
method) of the maxent Lagrangian is non-trivial, due 
to coupling that arises between the agents and the re- 
quirement for associated matrix inversion. However re- 
call that one way to motivate the entropic product dis- 
tribution Lagrangian L{q) starts by saying that what we 
really want is the fully coupled canonical ensemble dis- 
tribution, p^{x) oc exp{—/3G{x)). L{q) then measures 
qp KL-distance to that desired distribution. From this 
perspective, any given iteration of second order descent 
of the maxent Lagrangian runs downhill on a quadratic 
approximation to a distribution, a distribution that is it- 
self a product distribution approximation to the ultimate 
distribution we want to minimize. 



This suggests the alternative of making the approxi- 
mations in the opposite order. In this approach we first 
make a quadratic approximation (over the space of all 
p, not just all q) to the maxent Lagrangian, L{p). Via 
Newton's method this specifies a p* that minimizes that 
quadratic approximation. We can then find the product 
distribution that is nearest (in pq KL distance) to p*. 
This scheme is called Nearest Newton descent. 

The gradient and Hessian of L{p) are given by 



dp{x) ' 



= l3G{x) + l + \n{p°{x)) 



dp{x)dp{x') 



.i\ \p=p" 



pO{x) 



where p'^ is the current point. This Hessian is positive- 
definite (given that the current p is a member of V). 
By simple Lagrange parameters, the general solution is 
either on the border of T', or if in the interior is given by 

p*{x) = -/(x) [(3Gix)+\n{pix))+\] 



where A is set by normalization. Solving, either p* is on 
the edge of the simplex, or 



pO(x) 



1 - Sip') -Hp^ix)) ~ PlGix) - E{G)]. 



Note that the right-hand side is exactly the direction 
you should go using (simplex-constrained) gradient de- 
scent of L{p). So the direction to p* from p'^ is given by 
the Hadamard product of p*^ and the direction given by 
gradient descent. 

Now we can approximate p* with the product distri- 
bution having the minimal KL distance to it. In partic- 
ular, consider using pq KL distance rather than qp KL 
distance. Recall that for this kind of KL distance, the op- 
timal product distribution approximation to a joint dis- 
tribution is given by the product of the marginals of that 
joint distribution (see the discussion just below Eq. 
Say that is in the form of a product distribution, g'', 
i.e., that we are starting from a product distribution. 
Then calculating the marginals of the associated p* to 
get q* is trivial: 



S{q'^)-HqUj)) 
(}[E{G I X, = j) - E{G)] 



(6) 



Note that since the original quadratic approximation 
was over the full joint space, this formula automatically 
takes into account inter-agent couplings. In practice of 
course, it may make sense not to jump all the way from 

to q* , but only part-way there, to be conservative. (In 
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fact, if q* isn't in the interior of the simplex, such partial 
jumping is necessary.) One potential guide to how far to 
jump is the pq KL distance from p* to Yli Q* ■ Unlike the 
KL distances to the full joint Boltzmann distribution, we 
can readily calculate this KL distance. 

The conditional expectations in Nearest Newton are 
the same as those in gradient descent. Accordingly, they 
too can be estimated via Monte Carlo sampling, if need 
be. It's also worth noting that Eq. |Blhas the same form 
as one would get by evaluating the Hessian of the maxent 
Lagrangian, so long as one ignored inter-agent aspects of 
that Hessian. 



IV. OTHER LAGRANGIANS FOR FINDING 
MINIMA OF G 

There are many alternative Lagrangians to the ones 
described above. The section focuses on such alterna- 
tive Lagrangians for the purpose of finding argmin2;G(a;). 
Two classes of such Lagrangians are investigated: vari- 
ants of the Maxent Lagrangians, and variants of the two 
types of KL-distance Lagrangians. 



A. Maxent Lagrangians 



C. Practical issues 

In practice, the block-wise Monte Carlo sampling to 
estimate descent directions described above can be pro- 
hibitively slow. The estimates typically have high vari- 
ance, and therefore require large block size L to get a 
good descent direction. One set of ways to address this 
is to replace the team game with a non-team game, i.e., 
for each agent i have it estimate quantities E(gi \ Xi = j) 
rather than E{G \ Xi = j), where each private util- 
ity Qi is chosen to have much lower variance than G 

Another useful technique is to allow samples from pre- 
ceding blocks to be re-used. One does this by first "ag- 
ing" that data to reflect the fact that it was formed under 
a different qj^) . For example, one can replace the empir- 
ical average for the most recent block k, 

l^t=kL 

with a weighted average over the expected G"s of all pre- 
ceding blocks, 

E„G.,,-(m)e-('-'") 

m 

for some appropriate aging constant Ac.|38j| 

Typically such ageing allows L to be vastly reduced, 
and therefore the overall minimization of L to be greatly 
sped up. For such small L though, it may be that the 
most recent block has no samples of some move Xi = j. 
This would mean that Gij{k) is undefined. One crude 
way to avoid such problems is to simply force a set of 
samples of each such move if they don't occur of their own 
accord, being careful to have the formed by sampling 
when forming those forced samples. 

Other useful techniques allow one to properly decrease 
the step size as one nears the border of Q. 



Say that after finding the q that minimizes the La- 
grangian, we IID sample that g, K times. We then take 
the sample that has the smallest G value as our guess for 
the X that minimizes G{x). For this to give a low x we 
don't need the mean of the distribution q{G) to be low 
— what we need is that the bottom tail of that distribu- 
tion is low. This suggests that in the E{G) term of the 
Maxent Lagrangian we replace 

q{x) -> 

^ Q[k-J dx' q{x')e[G{x) - G{x')]] 

where O is the Heaviside theta function. This new mul- 
tiplier of G is still a probability distribution over x. It 
equals if G{x) is in the worst 1 — k percentile (accord- 
ing to distribution q) of G values, and otherwise. So 
under this replacement the E(G) term in the Lagrangian 
equals the average of G restricted to that lower K'th per- 
centile. For K = K~^, our new Lagrangian forces atten- 
tion in setting q on that outlier likely to come out of the 
K-iold sampling of q{G). 

One can use gradient descent and Monte Carlo sam- 
pling to minimize this Lagrangian, in the usual way. Note 
that the Monte Carlo process includes sampling the prob- 

un-t- A- i- -U ©[«-/ dx' q(x')e[G{x)-G{x')\\ „ 

ability distribution — — — — ^ — — as well as 

the qi. This means that only those points in the best 
K'th percentile are kept, and used for all Monte Carlo 
estimates. This may cause greater noise in the Monte 
Carlo sampling than would be the case for k = 1. 

As an example, say that for agent i, all of its moves 
have the same value of E{G | a;^), and similarly for agent 
j, and say that G is optimal if agents i and j both make 
move 0. Then if we modify the updating so that agent 
i only considers the best values that arose when it made 
move 0, and similarly for agent j, then both will be 
steered to prefer to make move to their alternatives. 
This will cause them to coordinate their moves in an op- 
timal manner. 

A similar modification is to replace G with /(G) 
in the maxent Lagrangian, for some concave nowhere- 
decreasing function /(.). Intuitively, this will have the 
effect of coordinating the updates of the separate qi at 
the end of the block, in a way to help lower G. It does 
this by distorting G to accentuate those x's with good 
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values. The price paid for this is that there will be more 
variance in the values of /(G) returned by the Monte 
Carlo sampling than those of G, in general. 

Note that if g is a local minimum of the Lagrangian 
for G, in general it will not be a local minimum for the 
Lagrangian of /(G) (the gradient will no longer be zero 
under that replacement, in general). So we can replace 
G with /(G) when we get stuck in a local minimum, 
and then return to G once q gets away from that local 
minimum. In this way we can break out of local minima, 
without facing the penalty of extra variance. Of course, 
none of these advantages in replacing G with /(G) hold 
for algorithms that directly search for an x giving a good 
G(x) value; a; is a local minimum of G{x) <^=> x is a local 
minimum of f{G{x)). 

An even simpler modification to the E{G) term than 
those considered above is to replace G{x) with Q[G[x) — 
K]. Under this replacement the E{G) term becomes the 
probability that G{x) > K. So minimizing it will push q 
to X with lower G values. For this modified Lagrangian, 
the gradient descent update steps adds the following to 
each qi{xi): 

a[(3q{G <K\xi) + \n{q,{xi)) 

_ T..,(iq{G<K\x',) + Hq,{x[)) 

In gradient descent of the Maxent Lagrangian we must 
Monte Carlo estimate the expected value of a real number 
(G). In contrast, in gradient descent of this modified 
Lagrangian we Monte Carlo estimate the expected value 
of a single bit: whether G exceeds K. Accordingly, the 
noise in the Monte Carlo estimation for this modified 
Lagrangian is usually far smaller. 

In all these variants it may make sense to replace the 
Heaviside function with a logistic function or an exponen- 
tial. In addition, in all of them the annealing schedule 
for K can be set by periodically searching for the K that 
is (estimated to be) optimal, just as one searches for op- 
timal coordinate systems (11.. 14] . Alternatively, a simple 
heuristic is to have K at the end of each block be set so 
that some pre-fixed percentage of the sampled points in 
the block go into our calculation of how to update q. 

Yet another possibility is to replace E{G) with the 
Ac'th percentile G value, i.e., with the K such that 
/ dx' q{x')Q{G{x') — K) — K. (To evaluate the partial 
derivative of that k with respect a particular qi{xi) one 
must use implicit differentiation.) 



B. KL- based Lagrangians 

Both the gp-KL Lagrangian and pq-KL Lagrangians 
discussed above had the target distribution be a Boltz- 
mann distribution over G. For high enough /3, such a 
distribution is peaked near argmina;G(x). So sampling 
an accurate approximation to it should give an x with 



low G, if (3 is large enough. This is why one way to min- 
imize G is to iteratively find a q that approximates the 
Boltzmann distribution, for higher and higher (3. 

However there are other target distributions that are 
peaked about minimizers of G. In particular, given any 
distribution p', the distribution 

^ p{x)Q[K-G{x)] 
' ~ j dx' p{x')Q[K - G{x')] 

is guaranteed to be more peaked about such minimizers 
than is p. So our minimization can be done by iterating 
the process of finding the q that best approximates Op and 
then setting p = q. This is analogous to the minimization 
algorithm considered in previous sections, which iterates 
the process of finding the q that best approximates the 
Boltzmann distribution and then increases p. 

For the choice of pq-KL distance as the approximation 
error, the q that best approximates 9p is just the product 
of the marginal distributions of 9p. So at the end of each 
iteration, we replace 

/rfx-,)g[,)(4))e[X-G(x„4))] 
^^"^'^ ^ Jdx' q'ix')e[K ~G{x')] 
q'jG <K\x,) 

q'{G < K) 
q'{x, \G<K) 

where q' is the product distribution being replaced. The 
denominator term in this expression is known exactly to 
agent z, and the numerator can be Monte-Carlo estimated 
by that agent using only observed G values. So like gra- 
dient descent on the Maxent Lagrangian, this update rule 
is well-suited to a distributed implementation. 

Note that if we replace the Heaviside function in this 
algorithm with an exponential with exponent /3, the up- 
date rule becomes 

where both expectations are evaluated under q' , the dis- 
tribution that generated the Monte Carlo samples. It's 
interesting to compare this update rule with the par allel 
Brouwer update rule for the team game prl. IT^TT^. to 
which it is very similar. [s^ This update is guaranteed to 
optimize the associated Lagrangian, unlike the Brouwer 
update. On the other hand, since it is based on the pq- 
KL Lagrangian, as mentioned above there is no formal 
guarantee that this alternative to Brouwer updating will 
decrease E{G). 

This update rule is also very similar to the adaptive 
importance sampling of the original pq-KL approach dis- 
cussed in [ll| . The difference is that in adaptive im- 
portance sampling the e"''"-^^^^ terms get replaced by 

Finally, consider using gp-KL distance to approximate 
g'(^) f dx'n'(I'CtK-Gi.')) rather than pq-KL distance. In 
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the Lagrangian for that distance the g' terms only con- 
tribute an overall additive constant. Aside from that con- 
stant, this qp-KL Lagrangian is identical to the Maxent 
Lagrangian. 

V. CONCLUSION 

Many problems in adaptive, distributed control can 
be cast as an iterated game. The coupling between 
the mixed strategies of the players arises as the system 
evolves from one instant to the next. This is what the 
system designer determines. Information theory tells us 
that the most likely joint strategy of the players, given 
a value of the expectation of the overall control objec- 
tive function, is the minimizer of a particular Lagrangian 
function of the joint strategy. So the goal of the sys- 
tem designer is to speed evolution of the joint strategy 
to that Lagrangian minimizing point, lower the expec- 
tated value of the control objective function, and repeat. 
Here we elaborate the theory of algorithms that do this 
using local descent procedures, and that thereby achieve 
efficient, adaptive, distributed control. 



B. Proof of claims following Lemma 1 



For generality, we provide the proofs in the general 
scenario where the private utilities gi may differ from 
one another. See the discussion in Section IlII CI 

i) Define fi{q) = J dxiqi{xi), i.e., fi is the constraint 
forcing qi to be normalized. Now for any q that equals 
zero for some joint move there must be an i and an x'^ such 
that qi{x[) = 0. Plugging into Lemma 1, we can evaluate 
the component of the direction of steepest descent along 
the direction of player i's probability of making move x'^: 



dqi{xi) dqi{xi) 

l3E{gi \xi)+ ln(gi(xi)) 

/dx^mg. |<) + lnfa«))] 



VI. APPENDIX 



This appendix provides proofs absent from the main 
text. 



A. Derivation of Lemma 1 

Proof: Consider the set of u such that the directional 
derivatives D^fi evaluated at x' all equal 0. These are 
the directions consistent with our constraints to first or- 
der. We need to find the one of those u such that D^V 
evaluated at x' is maximal. 

To simplify the analysis we introduce the constraint 
that \u\ = 1. This means that the directional derivative 
D^V for any function V is just u ■ W. We then use La- 
grange parameters to solve our problem. Our constraints 
on u are J2j ^] — ^ ^^'^ Dafi{x') = u ■ Vfi{x') = Vi. 
Our objective function is DiiV{x') = u- W{x'). 

Differentiating the Lagrangian gives 

2Aou, +^ A,;V/ = VV Vi. 

i 

with solution 

2Ao 

Ao enforces our constraint on \u\. Since we are only in- 
terested in specifying u up to a proportionality constant, 
we can set 2Ao = 1. Redefining the Lagrange parameters 
by multiplying them by —1 then gives the result claimed. 
QED. 



Since there must some x" such that qi{x") 0, 3xi such 
that /3E{gi \ x'[) + \n{qi{x")) is finite. Therefore our 
component is negative infinite. So L can be reduced by 
increasing qi{x'^. Accordingly, no q having zero prob- 
ability for some joint move x can be a minimum of i's 
Lagrangian. 



ii) To construct a bounded rational game with multiple 
equilibria, note that at any (necessarily interior) local 
minimum q, for each i, 

PE{gi \ Xt) + ln{qt{xi)) ^ 

(3 J dxi^i)gi{xi,X(^,-))Y]_qj{xj) + h-i{qi{xi)) 

must be independent of Xi, by Lemma 1. So say 
there is a component-by-component bijection T{x) = 
{Ti{xi),T2{x2), ■ ■ ■) that leaves all the {gj} unchanged, 
i.e., such that gj{x) — gj{T{x)) Va;, j 40]. 

Define q' by q'{x) — q{T{x)) Vx. Then for any two 
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values x] and xf, 

PE,,{g,\xl)+\n{ql{xl)) 

~pEg,{g,\x^) + Hqlix^)) 



(i j dx(,)5,(a;^a;(,)) J|(7j(r(a;j)) + \n{q,{T {x\))) 

- 13 j dx(o9i{x1,X(o)W_qj{T{xj))) + ln(g,(T(x^))) 



iii) To establish that at any q there is always a direction 
along which any player's Lagrangian is locally convex, fix 
all but two of the {f/i}, go and gi, and fix both go a-nd qi 
for all but two of their respective possible values, which 
we can write as 50(0), (70(1)1 9i(0), and gi(l), respectively. 
So we can parameterize the set of q we're considering by 
two real numbers, x = <Zo(0) and y = 91(0). The 2x2 
Hessian of L as a function of x and y is 



13 j dx(e)gi{xl,T ^{x(i)))'Wqj{xj) + \n{qi{T {x\))) 



1 

a — x 



a 

- + — 



f3 



dx(i)gi{xf,T ^{x(e)))\Yqj{xj)) + \n{qi{T{x1))^^ 



where a = 1 — (70(0) — 90(1) and h = 1 — (71(0) — and 
a is a function of gi and Y\j-Lo 1 <7i- Defining s = j + -^^^ 
,nd i = i + , the eigenvalues of that Hessian are 



/3 / dx^^^g,{T{x\),x^^^))W_qj{xj) + \n{q,{T {x\))) 



(3 j dx(^i)g^{T{x'i),X(^)))\\_qj{xj)) + ln((7i(T(x^))) 



infe(r(.Ti))) 

T(x?)) + Hq,{T{xm 



where the invariance of gi was used in the penultimate 
step. Since g is a local minimum though, this last differ- 
ence must equal 0. Therefore q' is also a local minimum. 

Now choose the game so that Vz, Xi, T{xi) ^ Xi. (Our 
congestion game example has this property.) Then the 
only way the transformation q — > q{T) can avoiding 
producing a new product distribution is if qi{xi) — 
qi{x[) \fi,Xi,x^, i.e., q is uniform. Say the Hessians of 
the players' Lagrangians are not all positive definite at 
the uniform q. (For example have our congestion game 
be biased away from uniform multiplicities.) Then that 
q is not a local minimum of the Lagrangians. Therefore 
at a local minimum, q ^ q{T). Accordingly, q and q{T) 
are two distinct equilibria. 



The eigenvalue for the positive root is necessarily posi- 
tive. Therefore along the corresponding eigenvector, L is 
convex at q. QED. 

iv) There are several ways to show that the value of 
E^f3{[gi]i^q^.^) must shrink as /3 grows. Here we do so by 
evaluating the associated derivative with respect to (3. 

Define N{U) = J dy e~^'^y\ the normalization con- 
stant for the distribution proportional to e~^^y\ View 
the Xi-indexed vector q^ as a function of (3, gi and 
So we can somewhat inelegantly write E{gi) — 
E i3,a X {qi)- Then one can expand 

dE{gi) _ 92in(iV(/3[g,].,,,J) 



d(3 



d(3^ 

= -Var([5r,]i,,(^)) 



where the variance is over possible Xi , sampled according 
to gf (x,). QED. 
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